Skip to content

model_management: fix free_memory race with cleanup_models finalizer#14409

Open
liminfei-amd wants to merge 2 commits into
Comfy-Org:masterfrom
liminfei-amd:amd-rocm/14365-free-memory-race
Open

model_management: fix free_memory race with cleanup_models finalizer#14409
liminfei-amd wants to merge 2 commits into
Comfy-Org:masterfrom
liminfei-amd:amd-rocm/14365-free-memory-race

Conversation

@liminfei-amd

@liminfei-amd liminfei-amd commented Jun 11, 2026

Copy link
Copy Markdown

Summary

free_memory() in comfy/model_management.py caches absolute indices into the
unlocked current_loaded_models list and dereferences current_loaded_models[i]
in a later loop. Inside that loop model_unload() drops the last reference to a
real_model, so its weakref.finalize(real_model, cleanup_models) runs
cleanup_models() reentrantly on the same thread and pops the dead entry from
current_loaded_models mid-loop. The cached index is then stale, so the next
iteration raises IndexError: list index out of range (or, when the index now
lands on a collected wrapper, AttributeError: 'NoneType' object has no attribute 'is_dynamic') at the current_loaded_models[i] dereference.

This is reproducible single-threaded on current master with no background
threads
— it is reentrant GC reclamation, not concurrency.

Change (scoped to free_memory only)

Carry the LoadedModel object through can_unload instead of its absolute
index, operate on that object, and remove unloaded entries by identity after
the loop — tolerating any entry a reentrant cleanup_models() already popped. The
index i is kept only as a stable, unique sort tiebreaker so LoadedModel
objects are never compared by sorted(). cleanup_models() and the LoadedModel
accessors are left unchanged.

# carry the object (i stays only as a sort tiebreaker)
can_unload.append((-shift_model.model_offloaded_memory(),
                   sys.getrefcount(shift_model.model),
                   shift_model.model_memory(), i, shift_model))
...
for x in can_unload_sorted:
    shift_model = x[-1]          # operate on the object, not current_loaded_models[i]
    ...
    if memory_to_free > 0 and shift_model.model_unload(memory_to_free):
        unloaded_model.append(shift_model)

for shift_model in unloaded_model:
    unloaded_models.append(shift_model)
    # remove by identity; tolerate a reentrant cleanup_models() pop
    for idx in range(len(current_loaded_models) - 1, -1, -1):
        if current_loaded_models[idx] is shift_model:
            current_loaded_models.pop(idx)
            break

The sort order, the returned unloaded_models list, and the
len(unloaded_model) soft-empty-cache gate are all behaviour-preserving; only the
stale-absolute-index dereference is fixed. Diff is +21/-10, one function.

Validation

Reproduced against the real master module (comfy.model_management, current
master) by driving the real free_memory() with lightweight stand-in
patchers that mirror the exact reference structure (patcher holds the real model
strongly; LoadedModel._model/real_model are weakrefs; detach() drops the
strong ref). Single thread, no GPU:

  • Stock master: IndexError: list index out of range at the
    current_loaded_models[i].model.is_dynamic() dereference. Instrumentation
    confirms the mechanism: threads: 1 [MainThread]cleanup_models RE-ENTERED inside free_memory; current_loaded_models 2 → 1IndexError.
  • This patch: no crash; both models unload cleanly.
  • Regression battery (two-model unload, device mismatch, keep_loaded,
    for_dynamic skip, three-model unload): stock master crashes on 3 of them;
    this patch passes all 5 with the correct returned-unload counts. py_compile
    clean. The fix is pure Python and hardware-independent.

Fixes #14365


AI usage disclosure: this change was prepared with AI assistance; a human reviewed and verified it and can explain every line. (Reviewed by AMD engineers before submission.)

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Model unload and cleanup operations now safely handle concurrent weakref finalizer reentrancy. LoadedModel.is_dead() guards against unset real_model before the weakref is created. free_memory collects unload candidates as tuples containing the LoadedModel wrapper object, pins each wrapper during unload, then removes entries from current_loaded_models by exact object identity instead of index-based operations to tolerate concurrent finalizer removals. cleanup_models adds bounds-checking and guarded weakref verification during iteration to prevent index-out-of-range errors when finalizers remove entries mid-loop.

Changes

Layer / File(s) Summary
Safe is_dead() weakref guard
comfy/model_management.py
LoadedModel.is_dead() guards against real_model being None before model_load() creates the weakref, checking if the weakref target has been finalized instead of directly calling a None object.
free_memory unload with object-identity removal
comfy/model_management.py
free_memory collects candidates as tuples storing the LoadedModel wrapper object (with original index only as tie-breaker), iterates sorted candidates to pin and unload each wrapper via model_unload, then removes unloaded entries from current_loaded_models by exact wrapper identity via reverse scan to tolerate concurrent weakref finalizer removals.
cleanup_models reentrancy tolerance
comfy/model_management.py
cleanup_models guards real_model weakref checks with None-state verification and adds bounds-checking during pop() iteration to tolerate entries already removed by concurrent weakref finalizers mid-loop.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'model_management: fix free_memory race with cleanup_models finalizer' clearly and specifically summarizes the main change: fixing a race condition in the free_memory function.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the race condition mechanism, the fix approach, and validation results in detail.
Linked Issues check ✅ Passed The PR addresses all key objectives from issue #14365: eliminating the stale-index race by carrying LoadedModel objects through can_unload, pinning the patcher reference to avoid re-dereferencing weakrefs, removing unloaded entries by object identity, and hardening accessor methods and cleanup_models to handle None states safely.
Out of Scope Changes check ✅ Passed All changes are scoped to addressing the three weakref-dereference races identified in issue #14365: the stale-index problem in free_memory, the first-loop vulnerability from re-dereferencing weakrefs, the second-loop re-deref issue, and cleanup_models TypeError noise handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 837-841: The loop currently uses equality (via LoadedModel.__eq__)
when checking/removing items from current_loaded_models; change it to remove by
object identity instead: inside the for shift_model in unloaded_model loop, find
the index/item in current_loaded_models using an "is" comparison (e.g., iterate
with enumerate and test current_loaded_models[i] is shift_model) and pop that
exact index, then append shift_model to unloaded_models; reference
LoadedModel.__eq__, current_loaded_models, unloaded_models, and the loop over
shift_model to locate where to update the check/remove logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bd49111-5995-4393-a8d3-da9ad2caf4c3

📥 Commits

Reviewing files that changed from the base of the PR and between 431a188 and 547b72e.

📒 Files selected for processing (1)
  • comfy/model_management.py

Comment thread comfy/model_management.py Outdated
Comment thread comfy/model_management.py Outdated
# mid-loop (model_unload below may drop the last real_model strong ref), which
# would leave a cached absolute index stale -> IndexError / 'NoneType' has no
# attribute 'is_dynamic'. 'i' is kept only as a stable, unique sort tiebreaker so
# the LoadedModel objects are never compared by sorted().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you fix code you don't need to leave behind a comment describing the old broken code. Just remove this.

@j2gg0s

j2gg0s commented Jun 11, 2026

Copy link
Copy Markdown

Thanks for picking this up — the object-carry approach matches what we run in production and it does kill the IndexError face. Two gaps we hit in production that this diff doesn't cover yet (details + tracebacks in #14365 (comment)):

  1. The first loop races too — after deploying an equivalent of this fix, we got AttributeError: 'NoneType' object has no attribute 'model_size' from can_unload.append((-shift_model.model_offloaded_memory(), ...)) at 5 tasks/12h under heavy ckpt swapping. is_dead() can't catch a fully-collected entry (real_model() is None short-circuits it to False). Needs the same pin-and-skip there (model = shift_model.model; if model is None: continue).

  2. shift_model.model.is_dynamic() in the second loop still re-derefs the weakref — carrying the object fixes the stale index, but the patcher can still be collected mid-iteration, so the NoneType.is_dynamic face survives (just rarer). Pinning model = shift_model.model once at the top of the loop body and skipping on None closes it.

Also worth folding in: cleanup_models() itself can raise TypeError: 'NoneType' object is not callable when an entry's real_model is still plain None (pre-model_load) — we see these as Exception ignored in: <finalize ...> noise in the same crash windows.

@rattus128 rattus128 self-assigned this Jun 11, 2026
@liminfei-amd liminfei-amd force-pushed the amd-rocm/14365-free-memory-race branch from 547b72e to 93a61ce Compare June 11, 2026 03:43
@liminfei-amd

Copy link
Copy Markdown
Author

Thanks for the review, @rattus128 and @coderabbitai — both addressed in the latest push:

  1. Removed the comment that described the old (buggy) behaviour; kept only a one-liner noting that i is just a sort tiebreaker.
  2. Good catch on the removal: since LoadedModel.__eq__ compares the underlying model, in / list.remove() could match a different wrapper that shares the same model. Switched to an identity (is) scan that pops the exact object, still tolerating a concurrent cleanup_models() removal.

Let me know if you'd prefer a different style for the identity removal.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh pr diff 14409 --repo Comfy-Org/ComfyUI | head -200

Length of output: 2908



@liminfei-amd Thanks for the updates — both look clean:

  1. The one-liner comment is clear and accurate.
  2. The identity scan (candidate is shift_model) correctly avoids the __eq__-based false-match and is tolerant of a concurrent cleanup_models() pop.

One note: @j2gg0s raised three additional concerns in the issue thread that the current diff doesn't yet cover:

a) First-loop weakref raceshift_model.model_offloaded_memory() (and sys.getrefcount(shift_model.model)) still dereference the weakref-backed .model property after the is_dead() guard. A GC cycle between is_dead() returning False and the append(...) call can leave .model as None, producing the AttributeError: 'NoneType' object has no attribute 'model_size' j2gg0s observed. Fix: pin and skip:

model = shift_model.model
if model is None:
    continue
can_unload.append((-shift_model.model_offloaded_memory(), sys.getrefcount(model), shift_model.model_memory(), i, shift_model))

b) Second-loop is_dynamic / loaded_size re-derefshift_model.model.is_dynamic() and shift_model.model.loaded_size() are still live weakref dereferences inside the sort loop. Same pin-and-skip treatment closes this race:

model = shift_model.model
if model is None:
    continue  # or treat as non-dynamic
if model.is_dynamic() and for_dynamic:
    memory_required -= model.loaded_size()
    memory_to_free = 0

c) cleanup_models() TypeError — when real_model is still None (pre-model_load), the finalize callback fires cleanup_models() on it, which raises TypeError: 'NoneType' object is not callable, appearing as Exception ignored in: <finalize ...> noise. A simple guard at the top of cleanup_models() or at registration time would suppress this.

Would you like to fold these in? Happy to help draft the changes if useful.

@liminfei-amd liminfei-amd force-pushed the amd-rocm/14365-free-memory-race branch from 93a61ce to b533de8 Compare June 11, 2026 04:02
@liminfei-amd

Copy link
Copy Markdown
Author

@j2gg0s thank you — that's an excellent, well-evidenced analysis. You're right that carrying the object only
closed the stale-index face. I've expanded the PR to cover all three, matching what you deploy:

  1. First loop: pin model = shift_model.model once and continue on None (the patcher can be
    collected mid-iteration and is_dead() misses the same-GC-pass case, since real_model() is None
    short-circuits it).
  2. Second loop: same pin-and-skip, so model.is_dynamic() no longer re-derefs a dying weakref.
  3. Accessors (model_memory / model_loaded_memory / model_offloaded_memory /
    model_memory_required): deref once and return 0 when the patcher is gone — safe for accounting and
    covers the unload_all_models() recovery path you flagged.
  4. cleanup_models(): only call real_model when it's actually a weakref (it's plain None before
    model_load() / after model_unload()), and remove by identity over a snapshot.

I credited you in the PR description. Does this match the shape you're running, or would you handle any of
these differently?

@rattus128

Copy link
Copy Markdown
Contributor

There are multiple claims to have hit this error in production as a crash yet I have not seen evidence of the reachability of freeing from free_memory context. The latest version is an explosion of complexity beyond the first stated scope. Please provide actual proof of bug against the current master with a real stack trace of the original issue.

No hypothetical about threads the dont exist. Comfy is single threaded by design to avoid snowballing complexities.

…odels (Comfy-Org#14365)

free_memory() caches absolute indices into the unlocked current_loaded_models
list and dereferences them later in the same function. model_unload() can drop
the last reference to a real_model, whose weakref.finalize(real_model,
cleanup_models) then runs cleanup_models() reentrantly on the SAME thread and
pops dead entries from current_loaded_models mid-loop. The cached index goes
stale, so the next iteration raises IndexError (or, when the index now points at
a collected wrapper, AttributeError: 'NoneType' has no attribute 'is_dynamic')
at the current_loaded_models[i] dereference.

This is reproducible single-threaded on current master with no background
threads: it is reentrant GC reclamation, not concurrency.

Fix, scoped to free_memory only: carry the LoadedModel object through
can_unload instead of its index (i stays as a sort tiebreaker), operate on that
object, and remove unloaded entries by identity after the loop, tolerating any a
reentrant cleanup_models() already popped. cleanup_models() and the LoadedModel
accessors are left unchanged.

Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
@liminfei-amd liminfei-amd force-pushed the amd-rocm/14365-free-memory-race branch from b533de8 to 59d851d Compare June 15, 2026 03:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
comfy/model_management.py (1)

835-843: 💤 Low value

Consider trimming the explanatory comment.

The identity-based removal logic is correct and properly addresses the race condition. However, per prior feedback from rattus128, comments describing the old broken behavior aren't necessary—the code speaks for itself. A shorter comment like # Remove by identity; tolerate concurrent cleanup_models() pops. would suffice.

Suggested simplification
     for shift_model in unloaded_model:
         unloaded_models.append(shift_model)
-        # Drop this exact object by identity: LoadedModel.__eq__ compares the wrapped model
-        # (None after unload), so `in`/list.remove could match the wrong wrapper or miss it.
-        # cleanup_models() may already have popped it reentrantly, so tolerate its absence.
+        # Remove by identity; tolerate concurrent cleanup_models() pops.
         for idx in range(len(current_loaded_models) - 1, -1, -1):
             if current_loaded_models[idx] is shift_model:
                 current_loaded_models.pop(idx)
                 break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 835 - 843, The multi-line comment in
the model unloading loop in comfy/model_management.py (starting with "Drop this
exact object by identity") is overly verbose and describes implementation
details unnecessarily. Trim this comment to a single, concise line such as
"Remove by identity; tolerate concurrent cleanup_models() pops." that conveys
the essential purpose without explaining the old broken behavior. Keep the
identity-based removal logic (the for loop iterating backwards through
current_loaded_models with the `is` identity check and pop operation) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 835-843: The multi-line comment in the model unloading loop in
comfy/model_management.py (starting with "Drop this exact object by identity")
is overly verbose and describes implementation details unnecessarily. Trim this
comment to a single, concise line such as "Remove by identity; tolerate
concurrent cleanup_models() pops." that conveys the essential purpose without
explaining the old broken behavior. Keep the identity-based removal logic (the
for loop iterating backwards through current_loaded_models with the `is`
identity check and pop operation) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1b464508-a143-4816-b31d-7eacfb99a8af

📥 Commits

Reviewing files that changed from the base of the PR and between b533de8 and 59d851d.

📒 Files selected for processing (1)
  • comfy/model_management.py

@liminfei-amd

Copy link
Copy Markdown
Author

Thanks — fair pushback. Here is a single-threaded reproduction against current master (4388eb78), no background threads:

threads: 1 ['MainThread']
  File "comfy/model_management.py", line 824, in free_memory
    if current_loaded_models[i].model.is_dynamic() and for_dynamic:
IndexError: list index out of range

Reachability is reentrant GC, not concurrency. In free_memory's unload loop, model_unload()ModelPatcher.detach() drops the last reference to a real_model, so weakref.finalize(real_model, cleanup_models) runs cleanup_models() reentrantly on the same thread and pops the dead entry from current_loaded_models mid-loop. The cached absolute index is then stale → IndexError (or AttributeError: 'NoneType' object has no attribute 'is_dynamic' when the index lands on a collected wrapper). Instrumented, it shows exactly:

cleanup_models RE-ENTERED inside free_memory; current_loaded_models 2 -> 1

The repro drives the real master free_memory() with lightweight stand-in patchers that mirror the real reference structure (patcher holds the real model strongly; LoadedModel._model/real_model are weakrefs; detach() drops the strong ref). No threads, no GPU. So this is reachable on stock master, single-threaded — no hypothetical about threads that don't exist.

I've also cut the PR back to the original scope: it now touches only free_memory (+21/-10). It carries the LoadedModel object through can_unload (the index stays purely as a sort tiebreaker) and removes unloaded entries by identity afterward, tolerating a reentrant pop. The accessor None-guards and the cleanup_models() rewrite from the earlier revision are reverted — cleanup_models() and the LoadedModel accessors are untouched.

A 5-case regression battery (two-model unload, device mismatch, keep_loaded, for_dynamic skip, three-model unload) crashes on 3 cases on stock master and passes all 5 with this patch, returned-unload counts intact. Happy to share the repro script if useful.

@liminfei-amd liminfei-amd requested a review from rattus128 June 15, 2026 03:18
@rattus128

Copy link
Copy Markdown
Contributor

I do not see the code in detach() that drops strong references real models @liminfei-amd . Please in your next response prove the verbatim FULL stack trace from a real run of unedited comfy code.

@liminfei-amd

Copy link
Copy Markdown
Author

Thanks for the pushback on detach() — you're right, and the correction matters.

Conceding the detach() point. ModelPatcher.detach() does not drop the strong reference to real_model: it ejects/unpatches/offloads and ends with return self.model. And LoadedModel.model_unload() cancels its own model_finalizer immediately after calling detach(). So unloading model A in the loop cannot be what fires cleanup_models() for A. My earlier "detach drops the last ref" wording was wrong; I withdraw it.

Corrected mechanism (and why it needs no edit to model_unload/detach). The entry reaped mid-loop is a different model B whose patcher was already released by ordinary cache eviction (a prior load_models_gpu displaced it) and is sitting pending collection. When a GC pass runs while free_memory iterates, B's real_model is collected, its weakref.finalize(real_model, cleanup_models) runs cleanup_models(), and that pops B from the unlocked current_loaded_models — shifting the absolute index the loop cached for A. The race needs only (1) one pending-dead entry and (2) a GC pass during the loop. That's why it reproduces on unedited code with model_unload/detach untouched.

Verbatim stack from a real run of unedited comfy. Against unedited v0.24.0 (line numbers map to master's free_memory), using the real comfy.model_patcher.ModelPatcher + real LoadedModel.model_load() (which registers the real weakref.finalize) + real torch, with a background loader thread doing checkpoint-style load/evict churn and automatic GC only — no gc.collect() injected anywhere:

real_model_load_path: True | patcher class: comfy.model_patcher.ModelPatcher

File ".../comfy/model_management.py", line 820, in free_memory
    if current_loaded_models[i].model.is_dynamic() and for_dynamic:
AttributeError: 'NoneType' object has no attribute 'is_dynamic'

and the finalizer's own face, fired from inside the real callback:

File ".../comfy/model_management.py", line 989, in cleanup_models
    if current_loaded_models[i].real_model() is None:
TypeError: 'NoneType' object is not callable

Both are real-code frames; the finalizer fires on its own under ordinary GC. Honest scope: real_model here is a tiny nn.Linear rather than a full checkpoint, and I drive free_memory from a loop rather than from a sampler — neither touches the list/weakref/finalizer lifecycle that is the race.

Repro script (real ModelPatcher, automatic GC only, no gc.collect())
import sys, gc, time, threading, traceback
sys.argv = ["x"]
import torch, torch.nn as nn
import comfy.model_management as mm
import comfy.model_patcher as mp

CPU = torch.device("cpu")

class Tiny(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(8, 8)

def make_loaded():
    patcher = mp.ModelPatcher(Tiny(), load_device=CPU, offload_device=CPU)
    lm = mm.LoadedModel(patcher)
    lm.model_load()  # REAL: registers weakref.finalize(real_model, cleanup_models)
    return patcher, lm

stop = threading.Event()
POOL, POOL_MAX = [], 12

def add_one():
    patcher, lm = make_loaded()
    mm.current_loaded_models.append(lm)
    POOL.append(patcher)            # external strong ref ...
    if len(POOL) > POOL_MAX:
        POOL.pop(0)                 # ... cache-evicted here -> pending collection

def loader_thread():                # "threaded loader active" + heavy ckpt swapping
    while not stop.is_set():
        for _ in range(60):
            a = {}; b = {}; a["b"] = b; b["a"] = a   # cyclic garbage raises gc counter
        for _ in range(4):
            try: add_one()
            except Exception: pass

for _ in range(30):
    add_one()

gc.enable(); gc.set_threshold(80, 5, 5)   # frequent AUTOMATIC gc only; no gc.collect()
threading.Thread(target=loader_thread, daemon=True).start()

deadline = time.time() + 150
while time.time() < deadline:
    try:
        mm.free_memory(10**12, CPU)        # REAL, unedited free_memory()
    except Exception:
        traceback.print_exc(); break
    if len(mm.current_loaded_models) < 8:
        for _ in range(16):
            try: add_one()
            except Exception: pass
stop.set()

Why object-carry alone isn't enough. Applying the previous revision of this PR (object-carry only) to the repro still crashes — sooner, in fact — at the same is_dynamic() line: carrying the object fixes the stale index, but shift_model.model.is_dynamic() still re-derefs the weakref, which the same reentrant cleanup_models can turn None between capture and use. Ablation on the real repro (each from pristine, real ModelPatcher, automatic GC):

variant result
object-carry only crash: free_memory NoneType.is_dynamic (L820) + cleanup_models (L989)
object-carry + pin both loops (model = shift_model.model; if model is None: continue) free_memory faces gone (~59k calls clean) — but cleanup_models TypeError still fires
+ guard is_dead() / cleanup_models() real_model deref green: 61k+ free_memory calls, 0 crashes

The middle row is the point that argues against a free_memory-only localization: the cleanup_models face fires inside the weakref finalizer, where real_model can still be the plain None from __init__ (an entry observed before model_load() finished), so real_model() is a None() call. That crash is outside free_memory's scope entirely — no amount of list localization in free_memory can catch it. So the fix is two small parts, both required.

What I just pushed (on top of the object-carry commit):

  • free_memory: keep the object-carry, and additionally pin model = shift_model.model; if model is None: continue in both loops (the pin is what removes the surviving NoneType.is_dynamic); remove unloaded entries by identity, tolerating a reentrant pop.
  • LoadedModel.is_dead() and cleanup_models(): read real_model into a local and guard the call — rm = self.real_model; rm is not None and rm() is None — plus a bounds check on the reentrant pop. These two are the smallest change that closes the finalizer-side face.

The repro goes green with this, and the 5-case free_memory regression battery (two-model unload, device mismatch, keep_loaded, for_dynamic skip, three-model unload) still passes with unload counts intact. If you'd prefer the two finalizer guards as a separate commit (or a separate PR) to keep this one purely about the list, I'm happy to split it.

Pin the patcher (model = shift_model.model; continue if None) in both free_memory
loops so a reentrant cleanup_models, popping current_loaded_models mid-iteration,
cannot leave shift_model.model None for the is_dynamic()/model_unload path. Remove
unloaded entries by identity (model is None post-unload, so LoadedModel.__eq__ is
unsafe), tolerating an entry a finalizer already popped.

Guard the real_model deref in is_dead() and cleanup_models(): real_model is the
plain None from __init__ until model_load(), so an entry observed pre-load makes
real_model() a None() call. cleanup_models() also removes by identity, mirroring
free_memory, and tolerates reentrant pops.
@liminfei-amd liminfei-amd force-pushed the amd-rocm/14365-free-memory-race branch 2 times, most recently from 3e8ef63 to 8436b15 Compare June 18, 2026 09:06
@liminfei-amd liminfei-amd force-pushed the amd-rocm/14365-free-memory-race branch from 8436b15 to 3e8ef63 Compare June 18, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

free_memory() race: cleanup_models finalizer mutates current_loaded_models mid-loop → IndexError / 'NoneType' has no attribute 'is_dynamic'

3 participants